# Simulating from a Bernoulli distribution
set.seed(853)
bernoulli_draws <- rbinom(n = 20, size = 1, prob = 0.3)
bernoulli_draws [1] 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research
Semester 1, 2026
Last updated: 2026-01-23
I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.
By the end of this week, you will be able to:
TSwD
ROS
Linear regression assumes a continuous outcome variable that can take any value on the real line.
But what if our outcome is binary?
The Problem
Linear regression can predict probabilities outside the range [0, 1]!
Binary outcomes follow the Bernoulli distribution:
The logit function transforms probabilities (bounded between 0 and 1) to values on the entire real line:
\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right)\]
Examples:
The ratio \(\frac{p}{1-p}\) is called the odds.
The inverse logit (or logistic) function converts values on the real line back to probabilities:
\[\text{logit}^{-1}(x) = \frac{e^x}{1 + e^x}\]
The logistic regression model predicts the probability that \(y = 1\):
\[\Pr(y_i = 1) = \text{logit}^{-1}(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \ldots)\]
Equivalently:
\[\text{logit}(\Pr(y_i = 1)) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \ldots\]
Key Insight
The coefficients \(\beta\) describe relationships on the log-odds scale, not the probability scale.
We use glm() with family = binomial:
Let’s simulate data on whether it’s a weekday based on the number of cars on the road:
# A tibble: 6 × 2
num_cars is_weekday
<int> <dbl>
1 9 0
2 64 1
3 90 1
4 93 1
5 17 0
6 29 0
Call:
glm(formula = is_weekday ~ num_cars, family = binomial(link = "logit"),
data = traffic_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.05861 1.19848 -8.393 <2e-16 ***
num_cars 0.20037 0.02311 8.669 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 689.94 on 499 degrees of freedom
Residual deviance: 152.98 on 498 degrees of freedom
AIC: 156.98
Number of Fisher Scoring iterations: 8
traffic_data |>
mutate(predicted_prob = predict(traffic_model, type = "response")) |>
ggplot(aes(x = num_cars)) +
geom_point(aes(y = is_weekday), alpha = 0.3) +
geom_line(aes(y = predicted_prob), colour = "#1791e8", linewidth = 1.2) +
labs(x = "Number of cars", y = "Probability (weekday)",
title = "Logistic regression fit") +
theme_minimal(base_size = 16)Non-linearity
Logistic regression coefficients are on the log-odds scale. A unit change in \(x\) does not correspond to a constant change in probability.
The same coefficient produces different probability changes depending on where you are on the curve:
Quick Interpretation
Divide the logistic regression coefficient by 4 to get the maximum possible effect on probability.
The logistic curve is steepest at its centre (where \(p = 0.5\)), with maximum slope = \(\beta/4\).
Example: If \(\beta = 0.19\), then:
\[\frac{0.19}{4} = 0.0475 \approx 5\%\]
A one-unit increase in \(x\) corresponds to at most a 5% increase in probability.
For more precise interpretation, evaluate the effect at a specific point:
(Intercept) num_cars
-10.0586139 0.2003724
(Intercept)
0.01716716
(Intercept)
0.4900034
(Intercept)
0.9814299
An alternative interpretation uses odds ratios. Exponentiating the coefficient gives the multiplicative change in odds:
num_cars
0.2003724
num_cars
1.221858
Interpretation: For each additional car, the odds of it being a weekday are multiplied by 1.22.
| Method | Result | Interpretation |
|---|---|---|
| Raw coefficient | \(\beta\) | Change in log-odds per unit \(x\) |
| Divide by 4 | \(\beta/4\) | Maximum probability change |
| Odds ratio | \(e^\beta\) | Multiplicative change in odds |
| Evaluate at \(\bar{x}\) | \(\text{logit}^{-1}(\ldots)\) | Probability at specific values |
Let’s examine a classic social science question: Does income predict voting for conservative parties?
Call:
glm(formula = vote_conservative ~ income, family = binomial(link = "logit"),
data = voting_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.38827 0.15950 -8.704 < 2e-16 ***
income 0.33935 0.04797 7.074 1.51e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1350.7 on 999 degrees of freedom
Residual deviance: 1298.4 on 998 degrees of freedom
AIC: 1302.4
Number of Fisher Scoring iterations: 4
income
0.08483779
# A tibble: 5 × 2
income probability
<int> <dbl>
1 1 0.259
2 2 0.330
3 3 0.408
4 4 0.492
5 5 0.577
Each unit increase in income corresponds to at most an ~8% increase in probability of voting conservative.
Use predict() with type = "response" for probabilities:
1 2 3
0.2594333 0.4084894 0.5765163
Using marginaleffects::predictions():
The marginal effect tells us how the probability changes for a small change in \(x\):
Estimate Std. Error
0.0652 0.00651
0.0820 0.01161
0.0829 0.01054
Note
Notice the marginal effect varies across income levels due to the non-linearity of the logistic function.
Residuals for logistic regression are defined as:
\[\text{residual}_i = y_i - \hat{p}_i\]
Because \(y\) is 0 or 1 and \(\hat{p}\) is between 0 and 1, residuals are discrete:
Solution: Binned Residuals
Divide observations into bins based on predicted probability, then plot the average residual in each bin.
We can evaluate predictions by classifying at a threshold (typically 0.5):
[1] 0.602
[1] 0.594
Compare to Null Model
Always compare your model’s accuracy to the null model (predicting the most common outcome for everyone).
A better evaluation metric is the log score (log-likelihood):
[1] -649.1941
[1] -0.6491941
Note
Higher (less negative) log scores indicate better fit. The log score properly accounts for the probability predictions, not just the classifications.
Bootstrapping is a resampling technique to estimate uncertainty:
# Simulate earnings data
set.seed(42)
earnings_data <- tibble(
earn = rlnorm(200, meanlog = 10, sdlog = 0.5),
male = rbinom(200, 1, 0.5)
) |>
mutate(earn = if_else(male == 1, earn * 1.2, earn))
# Observed ratio of median earnings
observed_ratio <- median(earnings_data$earn[earnings_data$male == 0]) /
median(earnings_data$earn[earnings_data$male == 1])
observed_ratio[1] 0.76043
# Bootstrap function
boot_ratio <- function(data) {
n <- nrow(data)
boot_indices <- sample(n, replace = TRUE)
boot_data <- data[boot_indices, ]
median(boot_data$earn[boot_data$male == 0]) /
median(boot_data$earn[boot_data$male == 1])
}
# Run bootstrap
n_sims <- 1000
boot_results <- replicate(n_sims, boot_ratio(earnings_data))
# Bootstrap standard error
sd(boot_results)[1] 0.05569499
Extending the voting model with gender:
# Add gender to our data
set.seed(42)
voting_data <- voting_data |>
mutate(
female = rbinom(n(), 1, 0.5),
vote_conservative = rbinom(n(), 1,
prob = plogis(-1.4 + 0.33 * income - 0.4 * female))
)
vote_model2 <- glm(vote_conservative ~ income + female,
data = voting_data,
family = binomial(link = "logit"))
summary(vote_model2)
Call:
glm(formula = vote_conservative ~ income + female, family = binomial(link = "logit"),
data = voting_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.23201 0.17285 -7.128 1.02e-12 ***
income 0.32146 0.04878 6.590 4.41e-11 ***
female -0.50390 0.13551 -3.719 2e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1325.2 on 999 degrees of freedom
Residual deviance: 1265.2 on 997 degrees of freedom
AIC: 1271.2
Number of Fisher Scoring iterations: 4
(Intercept) income female
-0.30800219 0.08036449 -0.12597400
poor_female.(Intercept) rich_male.(Intercept)
0.1955336 0.5927344
An interaction allows the effect of one predictor to depend on the value of another:
Call:
glm(formula = vote_conservative ~ income * female, family = binomial(link = "logit"),
data = voting_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.25555 0.22118 -5.677 1.37e-08 ***
income 0.32924 0.06674 4.933 8.08e-07 ***
female -0.45260 0.32867 -1.377 0.168
income:female -0.01675 0.09780 -0.171 0.864
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1325.2 on 999 degrees of freedom
Residual deviance: 1265.2 on 996 degrees of freedom
AIC: 1273.2
Number of Fisher Scoring iterations: 4
Best Practice
Centre continuous predictors before including interactions to make coefficients more interpretable.
(Intercept) income_c female income_c:female
-0.29746851 0.32923891 -0.50132906 -0.01674654
Now coefficients represent effects at the mean of the other variable.
Logistic Regression
glm() with family = binomialInterpretation
Model Evaluation
Uncertainty
Week 10: Count Models and Multilevel Modelling